Building the Customer and Product Modelling
1 Load Data
We first want to load our datasets and prepare them for some simple association rules mining.
1.1 Load Transaction Data
tnx_data_tbl <- read_rds("data/retail_data_cleaned_tbl.rds")
tnx_data_tbl %>% glimpse()## Rows: 1,067,371
## Columns: 23
## $ row_id <chr> "ROW0000001", "ROW0000002", "ROW0000003", "ROW000000…
## $ excel_sheet <chr> "Year 2009-2010", "Year 2009-2010", "Year 2009-2010"…
## $ invoice_id <chr> "489434", "489434", "489434", "489434", "489434", "4…
## $ stock_code <chr> "85048", "79323P", "79323W", "22041", "21232", "2206…
## $ description <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY …
## $ quantity <dbl> 12, 12, 12, 48, 24, 24, 24, 10, 12, 12, 24, 12, 10, …
## $ invoice_date <date> 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 200…
## $ price <dbl> 6.95, 6.75, 6.75, 2.10, 1.25, 1.65, 1.25, 5.95, 2.55…
## $ customer_id <chr> "13085", "13085", "13085", "13085", "13085", "13085"…
## $ country <chr> "United Kingdom", "United Kingdom", "United Kingdom"…
## $ stock_code_upr <chr> "85048", "79323P", "79323W", "22041", "21232", "2206…
## $ cancellation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ invoice_dttm <dttm> 2009-12-01 07:45:00, 2009-12-01 07:45:00, 2009-12-0…
## $ invoice_month <fct> December, December, December, December, December, De…
## $ invoice_dow <fct> Tuesday, Tuesday, Tuesday, Tuesday, Tuesday, Tuesday…
## $ invoice_dom <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01"…
## $ invoice_hour <chr> "07", "07", "07", "07", "07", "07", "07", "07", "07"…
## $ invoice_minute <chr> "45", "45", "45", "45", "45", "45", "45", "45", "45"…
## $ invoice_woy <chr> "49", "49", "49", "49", "49", "49", "49", "49", "49"…
## $ invoice_ym <chr> "200912", "200912", "200912", "200912", "200912", "2…
## $ stock_value <dbl> 83.40, 81.00, 81.00, 100.80, 30.00, 39.60, 30.00, 59…
## $ invoice_monthprop <dbl> 0.04347826, 0.04347826, 0.04347826, 0.04347826, 0.04…
## $ exclude <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
To use our rules mining we just need the invoice data and the stock code, so we can ignore the rest. Also, we ignore the issue of returns and just look at purchases.
tnx_purchase_tbl <- tnx_data_tbl %>%
filter(
quantity > 0,
price > 0,
exclude == FALSE
) %>%
drop_na(customer_id) %>%
select(
invoice_id, invoice_date, stock_code, customer_id, quantity, price,
stock_value, description
)
tnx_purchase_tbl %>% glimpse()## Rows: 802,651
## Columns: 8
## $ invoice_id <chr> "489434", "489434", "489434", "489434", "489434", "489434…
## $ invoice_date <date> 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-…
## $ stock_code <chr> "85048", "79323P", "79323W", "22041", "21232", "22064", "…
## $ customer_id <chr> "13085", "13085", "13085", "13085", "13085", "13085", "13…
## $ quantity <dbl> 12, 12, 12, 48, 24, 24, 24, 10, 12, 12, 24, 12, 10, 18, 3…
## $ price <dbl> 6.95, 6.75, 6.75, 2.10, 1.25, 1.65, 1.25, 5.95, 2.55, 3.7…
## $ stock_value <dbl> 83.40, 81.00, 81.00, 100.80, 30.00, 39.60, 30.00, 59.50, …
## $ description <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY LIGHT…
To build the association rules, we need to load the transactions in the format
required for the arules package. We set a date for our dataset before which
we wish to train our data and use the remainder as our model validation.
training_data_date <- as.Date("2011-03-31")We now combine all this data to construct our association rules.
tnx_purchase_tbl %>%
filter(invoice_date <= training_data_date) %>%
select(invoice_id, stock_code) %>%
write_csv("data/tnx_arules_input.csv")
basket_tnxdata <- read.transactions(
file = "data/tnx_arules_input.csv",
format = "single",
sep = ",",
header = TRUE,
cols = c("invoice_id", "stock_code")
)
basket_tnxdata %>% glimpse()## Formal class 'transactions' [package "arules"] with 3 slots
## ..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
## ..@ itemInfo :'data.frame': 4091 obs. of 1 variable:
## .. ..$ labels: chr [1:4091] "10002" "10080" "10109" "10120" ...
## ..@ itemsetInfo:'data.frame': 22873 obs. of 1 variable:
## .. ..$ transactionID: chr [1:22873] "489434" "489435" "489436" "489437" ...
1.2 Load Customer Data
We also want to load the various data about the customers such as their cohort and so on.
customer_cohort_tbl <- read_rds("data/customer_cohort_tbl.rds")
customer_cohort_tbl %>% glimpse()## Rows: 5,852
## Columns: 5
## $ customer_id <chr> "12346", "12347", "12348", "12349", "12350", "12351", …
## $ cohort_qtr <chr> "2010 Q1", "2010 Q4", "2010 Q3", "2010 Q2", "2011 Q1",…
## $ cohort_ym <chr> "2010 03", "2010 10", "2010 09", "2010 04", "2011 02",…
## $ first_tnx_date <date> 2010-03-02, 2010-10-31, 2010-09-27, 2010-04-29, 2011-…
## $ total_tnx_count <int> 3, 8, 5, 3, 1, 1, 9, 2, 1, 2, 6, 2, 5, 10, 6, 4, 10, 2…
1.3 Load Product Data
We also want to load the free-text description of the various stock items as this will help will interpretation of the various rules we have found.
product_data_tbl <- read_rds("data/stock_description_tbl.rds")
product_data_tbl %>% glimpse()## Rows: 4,725
## Columns: 2
## $ stock_code <chr> "10002", "10002R", "10080", "10109", "10120", "10123C", "10…
## $ desc <chr> "INFLATABLE POLITICAL GLOBE", "ROBOT PENCIL SHARPNER", "GRO…
2 Build Association Rules Model
We now build our association rules based on the lower support data.
The idea is to repeat some of the initial association rules analysis: we use the APRIORI algorithm to mine the rules, and then convert the discovered rules to produce a graph of the products and the rules.
With this graph, we then use the disjoint components of this graph to cluster the products, and take the largest subgraph and cluster that one according to some standard clustering.
2.1 Construct Association Rules
Having loaded the individual transaction data we now construct our basket data and use the APRIORI algorithm to discover our rules.
basket_arules <- apriori(
basket_tnxdata,
parameter = list(supp = 0.005, conf = 0.01)
)## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.01 0.1 1 none FALSE TRUE 5 0.005 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 114
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[4091 item(s), 22873 transaction(s)] done [0.17s].
## sorting and recoding items ... [1146 item(s)] done [0.01s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 done [0.06s].
## writing ... [6223 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
basket_arules_tbl <- basket_arules %>%
as("data.frame") %>%
as_tibble() %>%
arrange(desc(lift))
basket_arules_tbl %>% glimpse()## Rows: 6,223
## Columns: 6
## $ rules <chr> "{22459} => {22458}", "{22458} => {22459}", "{22520,22521,2…
## $ support <dbl> 0.005639837, 0.005639837, 0.005945875, 0.005071482, 0.00529…
## $ confidence <dbl> 0.8835616, 0.8600000, 0.9510490, 0.9508197, 0.9453125, 0.94…
## $ coverage <dbl> 0.006383072, 0.006557950, 0.006251913, 0.005333800, 0.00559…
## $ lift <dbl> 134.73137, 134.73137, 105.59875, 105.57329, 104.96181, 104.…
## $ count <int> 129, 129, 136, 116, 121, 137, 131, 131, 117, 136, 121, 142,…
Having constructed the main association rules, we then convert the discovered rules into a graph.
apriori_rules_igraph <- basket_arules %>%
plot(
measure = "support",
method = "graph",
engine = "igraph",
control = list(max = 20000)
)apriori_rules_igraph %>% summary()## IGRAPH d739801 DN-B 6882 14757 --
## + attr: name (v/c), label (v/c), type (v/n), support (v/n), confidence
## | (v/n), coverage (v/n), lift (v/n), count (v/n), order (v/n)
Having constructed the graph, we now want to visualise it.
basket_arules %>%
head(n = 500, by = "support") %>%
plot(
measure = "lift",
method = "graph",
engine = "htmlwidget"
)2.2 Determine Graph Clusters
With the constructed graph we now want to label the elements that are part of the disjoint components of the graph.
apriori_rules_tblgraph <- apriori_rules_igraph %>%
igraph::as.undirected(mode = "collapse") %>%
as_tbl_graph() %>%
mutate(
component_id = group_components()
) %>%
group_by(component_id) %>%
mutate(
component_size = n()
) %>%
ungroup()
apriori_rules_tblgraph %>% print()## # A tbl_graph: 6882 nodes and 14757 edges
## #
## # A bipartite simple graph with 217 components
## #
## # Node Data: 6,882 × 11 (active)
## name label type support confidence coverage lift count order component_id
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int> <int>
## 1 1 10002 1 NA NA NA NA NA NA 53
## 2 17 15036 1 NA NA NA NA NA NA 54
## 3 23 1505… 1 NA NA NA NA NA NA 1
## 4 24 1505… 1 NA NA NA NA NA NA 1
## 5 25 1505… 1 NA NA NA NA NA NA 1
## 6 56 1615… 1 NA NA NA NA NA NA 55
## # … with 6,876 more rows, and 1 more variable: component_size <int>
## #
## # Edge Data: 14,757 × 2
## from to
## <int> <int>
## 1 535 660
## 2 577 661
## 3 71 662
## # … with 14,754 more rows
From the graph, we extract the nodes that correspond to the products (as opposed to the nodes corresponding to the mined association rules). These are identified as the various numeric values attached to the rules are blank.
We also wish to add an additional column that is the size of the group, so it is easier to identify outsized subgraphs suitable for further partitioning.
product_cluster_disjoint_tbl <- apriori_rules_tblgraph %>%
activate(nodes) %>%
as_tibble() %>%
filter(are_na(support)) %>%
group_by(component_id) %>%
mutate(
cluster_size = n()
) %>%
ungroup() %>%
arrange(desc(cluster_size), label) %>%
group_by(component_id) %>%
mutate(
product_group_id = sprintf("AR_DISJOINT_%03d", cur_group_id()),
cluster_size,
stock_code = label
) %>%
ungroup() %>%
select(product_group_id, cluster_size, stock_code) %>%
arrange(product_group_id, stock_code)
product_cluster_disjoint_tbl %>% glimpse()## Rows: 659
## Columns: 3
## $ product_group_id <chr> "AR_DISJOINT_001", "AR_DISJOINT_001", "AR_DISJOINT_00…
## $ cluster_size <int> 365, 365, 365, 365, 365, 365, 365, 365, 365, 365, 365…
## $ stock_code <chr> "15056BL", "15056N", "15056P", "20674", "20675", "206…
We now segment up the largest disjoint subgraph using alternative clustering techniques.
We try a few different types - inspecting the output of the various algorithms to see which clustering may be the
run_subgraph_clusters <- function(graph_cluster_func, rules_tblgraph, ...) {
subgraph_clusters_tbl <- rules_tblgraph %>%
to_subgraph(component_size == max(component_size)) %>%
use_series(subgraph) %>%
morph(to_undirected) %>%
mutate(
sub_id = graph_cluster_func(...)
) %>%
unmorph() %>%
activate(nodes) %>%
as_tibble() %>%
filter(are_na(support)) %>%
count(sub_id, name = "cluster_size", sort = TRUE) %>%
mutate(
sub_id = factor(1:n(), levels = 1:n())
)
return(subgraph_clusters_tbl)
}
cluster_func <- c(
"group_fast_greedy",
"group_infomap",
"group_label_prop",
"group_louvain",
"group_spinglass"
)
cluster_data_tbl <- tibble(cluster_func_name = cluster_func) %>%
mutate(
cluster_func = map(cluster_func_name, get),
clustered = map(cluster_func, run_subgraph_clusters,
rules_tblgraph = apriori_rules_tblgraph)
) %>%
select(cluster_func_name, clustered) %>%
unnest(clustered)
cluster_data_tbl %>% glimpse()## Rows: 406
## Columns: 3
## $ cluster_func_name <chr> "group_fast_greedy", "group_fast_greedy", "group_fas…
## $ sub_id <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ cluster_size <int> 84, 41, 29, 27, 23, 21, 20, 18, 16, 15, 13, 12, 9, 9…
Having split this largest component into various splits, we now visualise the count and size of each cluster and use this to determine which clustering splits the data into a smaller number of larger clusters.
ggplot(cluster_data_tbl) +
geom_col(aes(x = sub_id, y = cluster_size)) +
geom_hline(aes(yintercept = 5), colour = "red") +
facet_wrap(vars(cluster_func_name), scales = "free") +
labs(
x = "ID",
y = "Cluster Size"
) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, size = 8))plot_tbl <- cluster_data_tbl %>%
group_by(cluster_func_name) %>%
count(cluster_size, name = "cluster_count", sort = TRUE) %>%
ungroup() %>%
mutate(cluster_size = as.factor(cluster_size))
ggplot(plot_tbl) +
geom_col(aes(x = cluster_size, y = cluster_count, group = cluster_size)) +
facet_wrap(vars(cluster_func_name), scales = "free") +
labs(
x = "Cluster Size",
y = "Community Count",
title = "Visualisation of Spread of Cluster Sizes"
) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))From this, it appears that louvain is the method of choice.
Thus, we re-run the clustering for this larger component using the chosen algorithm and use this to create our various product groups.
subgraph_groups_tbl <- apriori_rules_tblgraph %>%
to_subgraph(component_size == max(component_size)) %>%
use_series(subgraph) %>%
morph(to_undirected) %>%
mutate(
sub_id = group_louvain()
) %>%
unmorph() %>%
activate(nodes) %>%
as_tibble() %>%
filter(are_na(support)) %>%
group_by(sub_id) %>%
mutate(
cluster_size = n()
) %>%
ungroup() %>%
arrange(desc(cluster_size), label) %>%
group_by(sub_id) %>%
mutate(
product_group_id = sprintf("AR_LARGE_%03d", cur_group_id()),
cluster_size,
stock_code = label
) %>%
ungroup() %>%
select(product_group_id, cluster_size, stock_code) %>%
arrange(product_group_id, stock_code)
subgraph_groups_tbl %>% glimpse()## Rows: 365
## Columns: 3
## $ product_group_id <chr> "AR_LARGE_001", "AR_LARGE_001", "AR_LARGE_001", "AR_L…
## $ cluster_size <int> 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 1…
## $ stock_code <chr> "20711", "20712", "20713", "21928", "21929", "21930",…
We now combine both these lists of groupings and combine them.
product_cluster_tbl <- list(
product_cluster_disjoint_tbl,
subgraph_groups_tbl
) %>%
bind_rows() %>%
filter(product_group_id != "AR_DISJOINT_001")
product_cluster_tbl %>% glimpse()## Rows: 659
## Columns: 3
## $ product_group_id <chr> "AR_DISJOINT_002", "AR_DISJOINT_002", "AR_DISJOINT_00…
## $ cluster_size <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
## $ stock_code <chr> "21668", "21669", "21670", "21671", "21672", "21673",…
2.3 Assign Products to Groups
We now want to look at our complete list of products and then assign them to each of our product groups. In terms of coverage, we need to check to see if all the products appearing in the most invoices.
We also want to look at the most commonly purchased items (in terms of appearance in baskets as opposed to quantity sold).
product_popular_tbl <- tnx_purchase_tbl %>%
mutate(
stock_code = str_to_upper(stock_code)
) %>%
count(stock_code, name = "invoice_count", sort = TRUE)
product_popular_tbl %>% glimpse()## Rows: 4,621
## Columns: 2
## $ stock_code <chr> "85123A", "22423", "85099B", "84879", "20725", "21212", …
## $ invoice_count <int> 5188, 3428, 3360, 2777, 2678, 2654, 2141, 2121, 2117, 21…
We now combine this data to construct a product dataset containing the relevant summary data about each product.
product_data_full_tbl <- product_data_tbl %>%
left_join(product_cluster_tbl, by = "stock_code") %>%
left_join(product_popular_tbl, by = "stock_code") %>%
replace_na(
list(product_group_id = "none", cluster_size = "0")
) %>%
arrange(desc(invoice_count)) %>%
mutate(ranking = 1:n()) %>%
semi_join(tnx_purchase_tbl, by = "stock_code") %>%
arrange(stock_code)
product_data_full_tbl %>% glimpse()## Rows: 4,621
## Columns: 6
## $ stock_code <chr> "10002", "10080", "10109", "10120", "10123C", "10123G…
## $ desc <chr> "INFLATABLE POLITICAL GLOBE", "GROOVY CACTUS INFLATAB…
## $ product_group_id <chr> "AR_DISJOINT_053", "none", "none", "none", "none", "n…
## $ cluster_size <chr> "1", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0"…
## $ invoice_count <int> 317, 27, 1, 66, 50, 13, 19, 10, 133, 212, 52, 258, 16…
## $ ranking <int> 718, 3160, 4501, 2336, 2601, 3699, 3448, 3861, 1589, …
First, let us export the table to help us inspect the data.
product_data_full_tbl %>% datatable()To make it more obvious, we look at the products unassigned to a group and see how they rank in terms of invoice count.
product_data_full_tbl %>% filter(product_group_id == "none") %>% datatable()3 Construct Transaction-Based Graph Clusters
We can treat the transaction data as a graph, turning both invoices and stock items into nodes on the graph, and create an edge between stock and invoices when the item occurs on the invoice.
We then cluster the graph to create groupings for the different stock codes.
stock_nodes_tbl <- tnx_purchase_tbl %>%
select(stock_code) %>%
distinct() %>%
transmute(node_label = stock_code, node_type = "stock")
invoice_nodes_tbl <- tnx_purchase_tbl %>%
select(invoice_id) %>%
distinct() %>%
transmute(node_label = invoice_id, node_type = "invoice")
nodes_tbl <- list(stock_nodes_tbl, invoice_nodes_tbl) %>%
bind_rows()
edges_tbl <- tnx_purchase_tbl %>%
select(stock_code, invoice_id, quantity, price)
basket_tblgraph <- tbl_graph(
nodes = nodes_tbl,
edges = edges_tbl,
directed = FALSE,
node_key = "node_label"
)3.1 Check Graph Clustering Approaches
First we perform our basic clustering by splitting off the different disjoint components of the graph.
basket_tblgraph <- basket_tblgraph %>%
mutate(
component_id = group_components()
) %>%
group_by(component_id) %>%
mutate(
component_size = n()
) %>%
ungroup()
basket_tblgraph %>% print()## # A tbl_graph: 41215 nodes and 802651 edges
## #
## # An undirected multigraph with 6 components
## #
## # Node Data: 41,215 × 4 (active)
## node_label node_type component_id component_size
## <chr> <chr> <int> <int>
## 1 85048 stock 1 41202
## 2 79323P stock 1 41202
## 3 79323W stock 1 41202
## 4 22041 stock 1 41202
## 5 21232 stock 1 41202
## 6 22064 stock 1 41202
## # … with 41,209 more rows
## #
## # Edge Data: 802,651 × 4
## from to quantity price
## <int> <int> <dbl> <dbl>
## 1 1 4622 12 6.95
## 2 2 4622 12 6.75
## 3 3 4622 12 6.75
## # … with 802,648 more rows
We now want to check the sizes of the disjoint components of this graph.
basket_tblgraph %>%
as_tibble() %>%
filter(node_type == "stock") %>%
count(component_id, name = "stock_count", sort = TRUE)## # A tibble: 6 × 2
## component_id stock_count
## <int> <int>
## 1 1 4616
## 2 2 1
## 3 3 1
## 4 4 1
## 5 5 1
## 6 6 1
We see that almost all the stock codes are contained in that one large component and so confine the rest of this analysis to that one large component.
run_subgraph_clusters <- function(graph_cluster_func, labelling, input_tblgraph, ...) {
message(glue("Clustering the graph using {labelling}..."))
subgraph_clusters_tbl <- input_tblgraph %>%
mutate(
cluster_id = graph_cluster_func(...)
) %>%
activate(nodes) %>%
as_tibble() %>%
filter(node_type == "stock") %>%
count(cluster_id, name = "cluster_size", sort = TRUE) %>%
mutate(
cluster_id = factor(1:n(), levels = 1:n())
)
return(subgraph_clusters_tbl)
}cluster_func <- c(
"group_fast_greedy",
"group_infomap",
"group_leading_eigen",
"group_louvain"
)
largecomp_tblgraph <- basket_tblgraph %>%
to_subgraph(component_size == max(component_size)) %>%
use_series(subgraph)
cluster_data_tbl <- tibble(cluster_func_name = cluster_func) %>%
mutate(
cluster_func = map(cluster_func_name, get),
clustered = map2(
cluster_func, cluster_func_name,
run_subgraph_clusters,
input_tblgraph = largecomp_tblgraph
)
) %>%
select(cluster_func_name, clustered) %>%
unnest(clustered)
cluster_data_tbl %>% glimpse()## Rows: 217
## Columns: 3
## $ cluster_func_name <chr> "group_fast_greedy", "group_fast_greedy", "group_fas…
## $ cluster_id <fct> 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1…
## $ cluster_size <int> 1659, 1365, 833, 753, 3, 1, 1, 1, 3778, 54, 49, 25, …
Having created a summary of the data splits, we now want to construct a visualisation of how the various cluster routines split the data.
To do this, we turn the size of each cluster into a ‘label’ and then count how many clusters of that size there are. We then use this summary data to construct barplots of the size.
plot_tbl <- cluster_data_tbl %>%
group_by(cluster_func_name) %>%
count(cluster_size, name = "cluster_count", sort = TRUE) %>%
ungroup() %>%
mutate(cluster_size = as.factor(cluster_size))
ggplot(plot_tbl) +
geom_col(aes(x = cluster_size, y = cluster_count, group = cluster_size)) +
facet_wrap(vars(cluster_func_name), scales = "free") +
labs(
x = "Cluster Size",
y = "Community Count",
title = "Visualisation of Spread of Cluster Sizes"
) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))From this graphic, we see that we want to use group_louvain gives us the
most even split across the data - though the sizes are still hugely unequal.
3.2 Create Cluster-Based Allocation
We now use this algorithm to cluster this large component in the graph, and
this gives us an alternative allocation of the each stock_code to a product
group.
largecomp_clustered_tbl <- largecomp_tblgraph %>%
mutate(
cluster_id = group_louvain()
) %>%
activate(nodes) %>%
as_tibble() %>%
filter(node_type == "stock") %>%
mutate(
cluster_group = sprintf("TNX_%03d", cluster_id)
) %>%
select(stock_code = node_label, cluster_group)
largecomp_clustered_tbl %>% glimpse()## Rows: 4,616
## Columns: 2
## $ stock_code <chr> "85048", "79323P", "79323W", "22041", "21232", "22064", …
## $ cluster_group <chr> "TNX_005", "TNX_001", "TNX_001", "TNX_008", "TNX_003", "…
3.3 Combine Clustering Data
We now want to combine this data to construct our stock code allocations.
other_tbl <- basket_tblgraph %>%
activate(nodes) %>%
as_tibble() %>%
filter(
node_type == "stock",
component_size != max(component_size)
) %>%
transmute(
stock_code = node_label, cluster_group = "TNX_010"
)
product_group_tnxgroups_tbl <- list(
largecomp_clustered_tbl,
other_tbl
) %>%
bind_rows() %>%
arrange(stock_code) %>%
inner_join(product_data_tbl, by = "stock_code") %>%
select(stock_code, product_group = cluster_group, desc)
product_group_tnxgroups_tbl %>% glimpse()## Rows: 4,621
## Columns: 3
## $ stock_code <chr> "10002", "10080", "10109", "10120", "10123C", "10123G", …
## $ product_group <chr> "TNX_004", "TNX_002", "TNX_003", "TNX_004", "TNX_004", "…
## $ desc <chr> "INFLATABLE POLITICAL GLOBE", "GROOVY CACTUS INFLATABLE"…
4 Construct RFM Customer Segments
We now wish to repeat our RFM analysis, and then we reassign the customer base to each of these groupings.
segment_names <- c(
"Champions", "Loyal Customers", "Potential Loyalist", "New Customers",
"Promising", "Need Attention", "About To Sleep", "At Risk",
"Can't Lose Them", "Lost"
)
recency_lower <- c(4, 2, 3, 4, 3, 2, 2, 1, 1, 1)
recency_upper <- c(5, 5, 5, 5, 4, 3, 3, 2, 1, 2)
frequency_lower <- c(4, 3, 1, 1, 1, 2, 1, 2, 4, 1)
frequency_upper <- c(5, 5, 3, 1, 1, 3, 2, 5, 5, 2)
monetary_lower <- c(4, 3, 1, 1, 1, 2, 1, 2, 4, 1)
monetary_upper <- c(5, 5, 3, 1, 1, 3, 2, 5, 5, 2)
segment_defs_tbl <- tibble(
segment_names,
recency_lower,
recency_upper,
frequency_lower,
frequency_upper,
monetary_lower,
monetary_upper
)
segment_defs_tbl %>% glimpse()## Rows: 10
## Columns: 7
## $ segment_names <chr> "Champions", "Loyal Customers", "Potential Loyalist", …
## $ recency_lower <dbl> 4, 2, 3, 4, 3, 2, 2, 1, 1, 1
## $ recency_upper <dbl> 5, 5, 5, 5, 4, 3, 3, 2, 1, 2
## $ frequency_lower <dbl> 4, 3, 1, 1, 1, 2, 1, 2, 4, 1
## $ frequency_upper <dbl> 5, 5, 3, 1, 1, 3, 2, 5, 5, 2
## $ monetary_lower <dbl> 4, 3, 1, 1, 1, 2, 1, 2, 4, 1
## $ monetary_upper <dbl> 5, 5, 3, 1, 1, 3, 2, 5, 5, 2
We first visually inspect these segment definitions and the bands.
segments_show_tbl <- segment_defs_tbl %>%
mutate(
recency = glue("{recency_lower}-{recency_upper}") %>% as.character(),
frequency = glue("{frequency_lower}-{frequency_upper}") %>% as.character(),
monetary = glue("{monetary_lower}-{monetary_upper}") %>% as.character()
) %>%
select(
segment_names, recency, frequency, monetary
)
segments_show_tbl %>%
datatable(
colnames = c("Segment", "R", "F", "M"),
options = list(
columnDefs = list(list(className = 'dt-left', targets = 0:4))
)
)We now construct the RFM data from the purchase data and assign each of the customers to a segment based on their RFM score.
There is a reasonable number of transactions with a missing customer_id, so
we exclude this from the analysis.
customer_rfmdata <- tnx_purchase_tbl %>%
filter(
!are_na(customer_id),
invoice_date <= training_data_date
) %>%
group_by(invoice_date, customer_id) %>%
summarise(
.groups = "drop",
total_spend = sum(stock_value)
) %>%
rfm_table_order(
customer_id = customer_id,
order_date = invoice_date,
revenue = total_spend,
analysis_date = training_data_date
)
customer_rfmdata %>% print()## # A tibble: 4,691 × 9
## customer_id date_most_recent recency_days transaction_count amount
## <chr> <date> <dbl> <dbl> <dbl>
## 1 12346 2011-01-18 72 3 77353.
## 2 12347 2011-01-26 64 3 2510.
## 3 12348 2011-01-25 65 3 1061.
## 4 12349 2010-10-28 154 2 2221.
## 5 12350 2011-02-02 57 1 294.
## 6 12351 2010-11-29 122 1 301.
## 7 12352 2011-03-22 9 6 985.
## 8 12353 2010-10-27 155 1 318.
## 9 12355 2010-05-21 314 1 488.
## 10 12356 2011-01-18 72 4 5074.
## # … with 4,681 more rows, and 4 more variables: recency_score <int>,
## # frequency_score <int>, monetary_score <int>, rfm_score <dbl>
4.1 Visualise RFM Data
As we explored earlier, the rfm package provides a number of inbuilt
descriptive visualisations.
First we look at the count of customers at each order count:
customer_rfmdata %>%
rfm_order_dist(print_plot = FALSE)We also have a few summary plots - showing the histograms of the recency, frequency and monetary measures.
customer_rfmdata %>%
rfm_histograms(print_plot = FALSE) +
scale_x_continuous(labels = label_comma()) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))Finally, we look at each of the three bivariate plots to explore the relationship between the three quantities.
customer_rfmdata %>%
rfm_rm_plot(print_plot = FALSE) +
scale_x_log10(labels = label_comma()) +
scale_y_log10(labels = label_comma())customer_rfmdata %>%
rfm_rf_plot(print_plot = FALSE) +
scale_x_log10(labels = label_comma()) +
scale_y_log10(labels = label_comma())customer_rfmdata %>%
rfm_fm_plot(print_plot = FALSE) +
scale_x_log10(labels = label_comma()) +
scale_y_log10(labels = label_comma())4.2 Assign Customer Segments
We now assign each customer to a segment and this allows us to analyse each of the segments.
customer_segments_tbl <- customer_rfmdata %>%
rfm_segment(
segment_names = segment_names,
recency_lower = recency_lower,
recency_upper = recency_upper,
frequency_lower = frequency_lower,
frequency_upper = frequency_upper,
monetary_lower = monetary_lower,
monetary_upper = monetary_upper
)
customer_segments_tbl %>% glimpse()## Rows: 4,691
## Columns: 9
## $ customer_id <chr> "12346", "12347", "12348", "12349", "12350", "12351"…
## $ segment <chr> "Loyal Customers", "Loyal Customers", "Loyal Custome…
## $ rfm_score <dbl> 435, 435, 434, 224, 412, 312, 543, 212, 112, 445, 31…
## $ transaction_count <dbl> 3, 3, 3, 2, 1, 1, 6, 1, 1, 4, 1, 3, 8, 3, 4, 1, 1, 1…
## $ recency_days <dbl> 72, 64, 65, 154, 57, 122, 9, 155, 314, 72, 135, 122,…
## $ amount <dbl> 77352.96, 2510.50, 1061.40, 2221.14, 294.40, 300.93,…
## $ recency_score <int> 4, 4, 4, 2, 4, 3, 5, 2, 1, 4, 3, 3, 4, 3, 4, 4, 4, 1…
## $ frequency_score <int> 3, 3, 3, 2, 1, 1, 4, 1, 1, 4, 1, 3, 5, 3, 4, 1, 1, 1…
## $ monetary_score <int> 5, 5, 4, 4, 2, 2, 3, 2, 2, 5, 5, 5, 5, 4, 2, 2, 2, 2…
We want to plot the count of each of the customer segments, before we calculate the various summary statistics.
customer_segment_count_tbl <- customer_segments_tbl %>%
count(segment, name = "count", sort = TRUE)
ggplot(customer_segment_count_tbl) +
geom_col(aes(x = segment, y = count, fill = segment)) +
scale_fill_brewer(type = "qual", palette = "Set1") +
labs(
x = "Segment",
y = "Count"
) +
theme(
axis.text.x = element_text(angle = 20, vjust = 0.5),
legend.position = "none"
)Again, rfm provides a number of inbuilt plots of the segments, but as they
create very simple summary plots we create these plots ourselves and this
allows us to summarise the segments however we wish.
plot_tbl <- customer_segments_tbl %>%
select(
customer_id, segment,
Transactions = transaction_count,
Recency = recency_days,
`Total Spend` = amount
) %>%
pivot_longer(
!c(customer_id, segment),
names_to = "quantity",
values_to = "value"
) %>%
mutate(
value = pmax(0.1, value)
)
ggplot(plot_tbl) +
geom_boxplot(aes(x = segment, y = value, fill = segment)) +
expand_limits(y = 0.1) +
facet_wrap(vars(quantity), ncol = 2, scales = "free_y") +
scale_y_log10(labels = label_comma()) +
scale_fill_brewer(type = "qual", palette = "Set1") +
labs(
x = "Customer Segment",
y = "Value"
) +
theme(
axis.text.x = element_text(angle = 20, vjust = 0.5, size = 8),
legend.position = "none"
)4.3 Inspect Segment Validation
Now that we have assigned customers to segments, we use these segments to
assess the transactions made after the cutoff date training_data_date.
segments_alloc_tbl <- customer_segments_tbl %>%
select(customer_id, segment)
daily_spend_tbl <- tnx_purchase_tbl %>%
filter(invoice_date > training_data_date) %>%
group_by(invoice_date, customer_id) %>%
summarise(
.groups = "drop",
daily_spend = sum(stock_value)
) %>%
left_join(segments_alloc_tbl, by = "customer_id") %>%
replace_na(list(segment = "New Customer"))
daily_spend_tbl %>% glimpse()## Rows: 12,376
## Columns: 4
## $ invoice_date <date> 2011-04-01, 2011-04-01, 2011-04-01, 2011-04-01, 2011-04-…
## $ customer_id <chr> "12523", "12569", "12731", "12748", "12877", "12949", "13…
## $ daily_spend <dbl> 138.75, 222.68, 252.00, 25.50, 87.10, 1534.30, 1317.80, 6…
## $ segment <chr> "Champions", "New Customer", "Champions", "Champions", "C…
Having constructed this data, we now calculate some per-customer summary statistics.
postdate_customer_stats_tbl <- daily_spend_tbl %>%
group_by(customer_id, segment) %>%
summarise(
.groups = "drop",
total_transactions = n(),
total_spend = sum(daily_spend)
)
postdate_customer_stats_tbl %>% glimpse()## Rows: 3,843
## Columns: 4
## $ customer_id <chr> "12347", "12348", "12349", "12352", "12353", "12354…
## $ segment <chr> "Loyal Customers", "Loyal Customers", "At Risk", "L…
## $ total_transactions <int> 5, 2, 1, 3, 1, 1, 1, 2, 1, 2, 2, 3, 9, 2, 4, 1, 1, …
## $ total_spend <dbl> 3122.82, 597.00, 1457.55, 744.23, 89.00, 1079.40, 4…
First we compare the segment counts from both pre- and post- dates. The first metric to check is the proportion of the segment in the dataset, though we exclude newly arrived customers in the post-date dataset to enable a direct comparison.
pre_data_tbl <- customer_segment_count_tbl %>%
mutate(prop = count / sum(count))
post_data_tbl <- postdate_customer_stats_tbl %>%
filter(segment != "New Customer") %>%
count(segment, name = "count", sort = TRUE) %>%
mutate(prop = count / sum(count))
comparison_tbl <- list(
Training = pre_data_tbl,
Validation = post_data_tbl
) %>%
bind_rows(.id = "data")
ggplot(comparison_tbl) +
geom_col(aes(x = segment, y = prop, fill = data), position = "dodge") +
labs(
x = "Segment",
y = "Segment Proportion",
fill = "Dataset"
) +
scale_fill_brewer(type = "qual", palette = "Dark2") +
theme(axis.text.x = element_text(angle = 20, vjust = 0.5))Next, we check our RFM stats according the validation data as of the final date in the dataset.
First we construct the data and then construct boxplots of each segment.
validation_date <- daily_spend_tbl %>% pull(invoice_date) %>% max()
validation_rfm_data_tbl <- daily_spend_tbl %>%
group_by(segment, customer_id) %>%
summarise(
.groups = "drop",
transaction_count = n(),
recency_days = (validation_date - max(invoice_date)) %>% as.numeric(),
amount = sum(daily_spend)
)
validation_rfm_data_tbl %>% glimpse()## Rows: 3,843
## Columns: 5
## $ segment <chr> "About To Sleep", "About To Sleep", "About To Sleep"…
## $ customer_id <chr> "12353", "12445", "12457", "12475", "12519", "12787"…
## $ transaction_count <int> 1, 1, 6, 1, 1, 3, 1, 1, 2, 5, 2, 1, 1, 1, 3, 1, 1, 3…
## $ recency_days <dbl> 204, 22, 66, 53, 63, 9, 192, 96, 84, 4, 4, 196, 64, …
## $ amount <dbl> 89.00, 77.40, 2043.23, 669.38, 286.84, 418.66, 294.9…
Having constructed the table - we view the data as a JS datatable.
validation_rfm_data_tbl %>% datatable()We now produce boxplots of the three metrics using these segments, and also look at the new customers as a separate category.
plot_tbl <- validation_rfm_data_tbl %>%
select(
customer_id, segment,
Transactions = transaction_count,
Recency = recency_days,
`Total Spend` = amount
) %>%
pivot_longer(
!c(customer_id, segment),
names_to = "quantity",
values_to = "value"
) %>%
mutate(
value = pmax(0.1, value)
)
ggplot(plot_tbl) +
geom_boxplot(aes(x = segment, y = value, fill = segment)) +
expand_limits(y = 0.1) +
facet_wrap(vars(quantity), ncol = 2, scales = "free_y") +
scale_y_log10(labels = label_comma()) +
scale_fill_brewer(type = "qual", palette = "Set1") +
labs(
x = "Customer Segment",
y = "Value"
) +
theme(
axis.text.x = element_text(angle = 20, vjust = 0.5, size = 8),
legend.position = "none"
)As an alternative plot, we also look at the violin plots rather boxplots.
ggplot(plot_tbl) +
geom_violin(aes(x = segment, y = value, fill = segment)) +
expand_limits(y = 0.1) +
facet_wrap(vars(quantity), ncol = 2, scales = "free_y") +
scale_y_log10(labels = label_comma()) +
scale_fill_brewer(type = "qual", palette = "Set1") +
labs(
x = "Customer Segment",
y = "Value"
) +
theme(
axis.text.x = element_text(angle = 20, vjust = 0.5, size = 8),
legend.position = "none"
)We certainly see that our “Champions” are the ‘best’ segment in terms of the three metrics we measure. That said, our new customers also appear to have desirable metrics on aggregate.
4.4 Simple Product Description Analysis
With each of the products classified into different groups we now inspect the product descriptions of those products.
We split the product descriptions by product_group and construct a word cloud
for each one. Word clouds are a simple starting point to help visualise common
words in a group.
create_wordcloud <- function(tokens_tbl, seed = 42) {
cloud_plot <- tokens_tbl %>%
count(word, name = "freq", sort = TRUE) %>%
slice_max(order_by = freq, n = 100) %>%
ggwordcloud2(seed = seed)
return(cloud_plot)
}
product_group_tokens_tbl <- product_group_tnxgroups_tbl %>%
group_by(product_group) %>%
mutate(product_count = n()) %>%
mutate(label = glue("{product_group} ({product_count})")) %>%
ungroup() %>%
unnest_tokens(word, desc)
w_tbl <- product_group_tokens_tbl %>%
filter(product_group != "TNX_010") %>%
group_nest(label, product_count, .key = "token_data") %>%
mutate(
cloud = map2(token_data, product_count, create_wordcloud)
)
plot_grid(
plotlist = w_tbl$cloud,
labels = w_tbl$label,
ncol = 3
)We now repeat this exercise but using a slightly lower-level use of the word cloud data.
wc_input_tbl <- product_group_tokens_tbl %>%
count(label, word, name = "freq", sort = TRUE) %>%
group_by(label) %>%
slice_max(order_by = freq, n = 50) %>%
ungroup()
ggplot(wc_input_tbl %>% filter(label != "TNX_010 (6)")) +
geom_text_wordcloud_area(
aes(label = word, size = freq),
shape = "square", seed = 42, rm_outside = TRUE) +
facet_wrap(vars(label), scales = "free", ncol = 3) +
scale_size_area(max_size = 15) +
theme_minimal()5 Analyse Categorical Co-occurrences
We now need to perform a correspondence analysis (CA) for the co-occurence of customer grouping and product grouping to see what types of products are associated with the various groups.
5.1 Construct Customer and Product Allocations
We need to construct a lookup table for all customers in our dataset. Those customers which first appeared in our ‘validation’ data are assigned the grouping “New Customer”.
customer_segment_allocation_tbl <- tnx_purchase_tbl %>%
drop_na(customer_id) %>%
select(customer_id) %>%
distinct() %>%
left_join(segments_alloc_tbl, by = "customer_id") %>%
replace_na(list(segment = "New Customer")) %>%
arrange(customer_id)
customer_segment_allocation_tbl %>% glimpse()## Rows: 5,852
## Columns: 2
## $ customer_id <chr> "12346", "12347", "12348", "12349", "12350", "12351", "123…
## $ segment <chr> "Loyal Customers", "Loyal Customers", "Loyal Customers", "…
We also construct a lookup table allocating each stock_code to a
product_group_id.
product_group_allocation_tbl <- product_data_full_tbl %>%
select(stock_code, product_group_id) %>%
arrange(stock_code)
product_group_allocation_tbl %>% glimpse()## Rows: 4,621
## Columns: 2
## $ stock_code <chr> "10002", "10080", "10109", "10120", "10123C", "10123G…
## $ product_group_id <chr> "AR_DISJOINT_053", "none", "none", "none", "none", "n…
5.2 Construct Co-occurence Analysis
We now wish to add the various product and customer groupings to our transaction data to construct a table we can later us for some correspondence analysis. Prior to pursuing library routines to perform this analysis, we first look at some of the co-occurence frequencies.
tnx_correspondence_tbl <- tnx_purchase_tbl %>%
select(
invoice_id, invoice_date, stock_code, customer_id
) %>%
mutate(
tnx_ym = invoice_date %>% format("%Y%m"),
tnx_dow = invoice_date %>% format("%A"),
tnx_month = invoice_date %>% format("%B"),
tnx_qtr = invoice_date %>% as.yearqtr() %>% as.character()
) %>%
inner_join(customer_cohort_tbl, by = "customer_id") %>%
inner_join(customer_segment_allocation_tbl, by = "customer_id") %>%
inner_join(product_group_tnxgroups_tbl, by = "stock_code")
tnx_correspondence_tbl %>% glimpse()## Rows: 802,651
## Columns: 15
## $ invoice_id <chr> "489434", "489434", "489434", "489434", "489434", "489…
## $ invoice_date <date> 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-…
## $ stock_code <chr> "85048", "79323P", "79323W", "22041", "21232", "22064"…
## $ customer_id <chr> "13085", "13085", "13085", "13085", "13085", "13085", …
## $ tnx_ym <chr> "200912", "200912", "200912", "200912", "200912", "200…
## $ tnx_dow <chr> "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday",…
## $ tnx_month <chr> "December", "December", "December", "December", "Decem…
## $ tnx_qtr <chr> "2009 Q4", "2009 Q4", "2009 Q4", "2009 Q4", "2009 Q4",…
## $ cohort_qtr <chr> "2009 Q4", "2009 Q4", "2009 Q4", "2009 Q4", "2009 Q4",…
## $ cohort_ym <chr> "2009 12", "2009 12", "2009 12", "2009 12", "2009 12",…
## $ first_tnx_date <date> 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-…
## $ total_tnx_count <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 55, 55, 55, 55, 55…
## $ segment <chr> "Champions", "Champions", "Champions", "Champions", "C…
## $ product_group <chr> "TNX_005", "TNX_001", "TNX_001", "TNX_008", "TNX_003",…
## $ desc <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY LI…
We now want to do some basic analysis based on the relative frequencies of the values of both the product groups and the customer segments.
segment_group_contingency_tbl <- tnx_correspondence_tbl %>%
filter(
invoice_date <= training_data_date
) %>%
count(segment, product_group, name = "group_count") %>%
mutate(
obs_prop = group_count / sum(group_count)
)
segment_group_contingency_tbl %>% glimpse()## Rows: 75
## Columns: 4
## $ segment <chr> "About To Sleep", "About To Sleep", "About To Sleep", "A…
## $ product_group <chr> "TNX_001", "TNX_002", "TNX_003", "TNX_004", "TNX_005", "…
## $ group_count <int> 973, 270, 1045, 1778, 1433, 265, 525, 129, 161, 5232, 45…
## $ obs_prop <dbl> 1.944406e-03, 5.395576e-04, 2.088288e-03, 3.553086e-03, …
segment_props_tbl <- segment_group_contingency_tbl %>%
count(segment, wt = group_count, name = "segment_count") %>%
mutate(
segment_prop = segment_count / sum(segment_count)
)
prodgroup_props_tbl <- segment_group_contingency_tbl %>%
count(product_group, wt = group_count, name = "prodgroup_count") %>%
mutate(
prodgroup_prop = prodgroup_count / sum(prodgroup_count)
)
segment_group_independ_tbl <- expand_grid(
segment_props_tbl,
prodgroup_props_tbl
) %>%
transmute(
segment, product_group, segment_count, prodgroup_count,
independ_prop = segment_prop * prodgroup_prop
)
segment_group_independ_tbl %>% glimpse()## Rows: 80
## Columns: 5
## $ segment <chr> "About To Sleep", "About To Sleep", "About To Sleep", …
## $ product_group <chr> "TNX_001", "TNX_002", "TNX_003", "TNX_004", "TNX_005",…
## $ segment_count <int> 6579, 6579, 6579, 6579, 6579, 6579, 6579, 6579, 6579, …
## $ prodgroup_count <int> 106681, 27351, 94400, 94937, 64105, 31001, 47590, 1560…
## $ independ_prop <dbl> 2.802819e-03, 7.185899e-04, 2.480161e-03, 2.494270e-03…
We now compare the observed proportions of these combinations against the “theoretical” proportions under the assumptions of independence.
segment_group_combined_tbl <- segment_group_independ_tbl %>%
left_join(segment_group_contingency_tbl, by = c("segment", "product_group")) %>%
replace_na(list(group_count = 0, obs_prop = 0)) %>%
mutate(
ignore = (group_count == 0) | (independ_prop <= 1e-4),
diff = (obs_prop - independ_prop),
perc = if_else(ignore == TRUE, 0, (obs_prop / independ_prop) - 1)
)
segment_group_combined_tbl %>% glimpse()## Rows: 80
## Columns: 10
## $ segment <chr> "About To Sleep", "About To Sleep", "About To Sleep", …
## $ product_group <chr> "TNX_001", "TNX_002", "TNX_003", "TNX_004", "TNX_005",…
## $ segment_count <int> 6579, 6579, 6579, 6579, 6579, 6579, 6579, 6579, 6579, …
## $ prodgroup_count <int> 106681, 27351, 94400, 94937, 64105, 31001, 47590, 1560…
## $ independ_prop <dbl> 2.802819e-03, 7.185899e-04, 2.480161e-03, 2.494270e-03…
## $ group_count <dbl> 973, 270, 1045, 1778, 1433, 265, 525, 129, 161, 0, 523…
## $ obs_prop <dbl> 1.944406e-03, 5.395576e-04, 2.088288e-03, 3.553086e-03…
## $ ignore <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
## $ diff <dbl> -8.584131e-04, -1.790324e-04, -3.918737e-04, 1.058817e…
## $ perc <dbl> -0.306267796, -0.249144013, -0.158003300, 0.424499648,…
Now that we have combined this data, we construct a heatmap of the proportions.
ggplot(segment_group_combined_tbl) +
geom_tile(aes(x = segment, y = product_group, fill = diff)) +
geom_text(aes(x = segment, y = product_group,
label = label_comma(accuracy = 1)(group_count))) +
labs(
x = "Customer Segment",
y = "Product Group",
fill = "Diff"
) +
scale_fill_gradient2(low = "blue", high = "red", midpoint = 0) +
ggtitle("Heatmap of Differences Between Observed and Theoretical Proportions") +
theme(axis.text.x = element_text(angle = 10, vjust = 0.5))We now make a similar plot, but plotting the relative percentage difference between the theoretical proportion under assumptions and independence and what was observed.
ggplot(segment_group_combined_tbl) +
geom_tile(aes(x = segment, y = product_group, fill = perc)) +
geom_text(aes(x = segment, y = product_group,
label = label_comma(accuracy = 1)(group_count))) +
labs(
x = "Customer Segment",
y = "Product Group",
fill = "Perc"
) +
scale_fill_gradient2(low = "blue", high = "red", midpoint = 0) +
ggtitle("Heatmap of Percentage Differences Between Observed and Theoretical Proportions") +
theme(axis.text.x = element_text(angle = 10, vjust = 0.5))5.3 Construct Balloon Plot
Another way to visualise this co-occurence of values is to construct a balloon plot, which shows the counts of each combination of the categorical values within our dataset.
plot_tbl <- tnx_correspondence_tbl %>%
count(segment, product_group, name = "freq_count")
ggballoonplot(
plot_tbl,
show.label = TRUE,
ggtheme = theme_cowplot()
) +
labs(
x = "Customer Segment",
y = "Product Group",
size = "Count"
) +
theme(axis.text.x = element_text(angle = 20))5.4 Perform Basic Correspondence Analysis
To run the various routines to calculate the correspondence analysis we need to convert our data into a matrix format required by the libraries.
To do this efficiently, we construct a quick function to convert these data tables into a matrix with the first column becoming the row name for the matrix.
make_df_matrix <- function(data_tbl) {
seg_names <- data_tbl %>% pull(1)
data_mat <- data_tbl %>%
select(-1) %>%
as.matrix() %>%
set_rownames(seg_names)
return(data_mat)
}segment_group_freq_tbl <- tnx_correspondence_tbl %>%
count(segment, product_group, name = "freq_count") %>%
pivot_wider(
id_cols = segment,
names_from = product_group,
values_from = freq_count,
values_fill = 0
)
segment_group_mat <- segment_group_freq_tbl %>%
make_df_matrix()
segment_group_mat %>% glimpse()## int [1:9, 1:10] 1325 5637 77313 3448 29463 2394 6587 3343 8660 1271 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:9] "About To Sleep" "At Risk" "Champions" "Lost" ...
## ..$ : chr [1:10] "TNX_001" "TNX_002" "TNX_003" "TNX_004" ...
Now that we have the frequency matrix we can run the CA routine and visualise
the outputs.
segment_group_ca <- segment_group_mat %>% CA(graph = FALSE)First we visualise the variance determined from the eigenvalues of the matrix. This gives us an indication of the ‘true’ underlying dimensionality of the data.
segment_group_ca %>% fviz_eig()From this plot, we see that we can capture almost 94% of the variance with just two dimensions.
We then construct the biplot - which is a way of visualising the co-occurrences of each of the categorical values.
segment_group_ca %>%
fviz_ca_biplot(
repel = TRUE,
title = "CA Biplot of Customer Segment Against Product Group"
)According to the biplots, there is a suggested relationship between customers in the “Champions” category and those products in grouping “TNX_009” - we ignore “TNX_010” as it is very small.
As before, let us look at a wordcloud on the types of items in that.
wc_009_tbl <- product_group_tokens_tbl %>%
filter(product_group == "TNX_009") %>%
count(word, name = "freq") %>%
slice_max(order_by = freq, n = 100)
wc_plot <- ggwordcloud2(
data = wc_009_tbl,
shuffle = FALSE,
size = 4,
seed = 42421
)
wc_plot %>% plot()We also see that customers designated “Potential Loyalist” are connected to “TNX_005”. So, we show a word cloud for this group also.
wc_005_tbl <- product_group_tokens_tbl %>%
filter(product_group == "TNX_005") %>%
count(word, name = "freq") %>%
slice_max(order_by = freq, n = 100)
wc_plot <- ggwordcloud2(
data = wc_005_tbl,
shuffle = FALSE,
size = 4,
seed = 42422
)
wc_plot %>% plot()6 R Environment
sessioninfo::session_info()## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.1.0 (2021-05-18)
## os Ubuntu 20.04.3 LTS
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
## date 2021-10-14
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date lib source
## abind 1.4-5 2016-07-21 [1] RSPM (R 4.1.0)
## arules * 1.6-8 2021-05-17 [1] RSPM (R 4.1.0)
## arulesViz * 1.5-0 2021-05-21 [1] RSPM (R 4.1.0)
## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.1.0)
## backports 1.2.1 2020-12-09 [1] RSPM (R 4.1.0)
## bit 4.0.4 2020-08-04 [1] RSPM (R 4.1.0)
## bit64 4.0.5 2020-08-30 [1] RSPM (R 4.1.0)
## bookdown 0.22 2021-04-22 [1] RSPM (R 4.1.0)
## broom 0.7.9 2021-07-27 [1] RSPM (R 4.1.0)
## bslib 0.2.5.1 2021-05-18 [1] RSPM (R 4.1.0)
## cachem 1.0.5 2021-05-15 [1] RSPM (R 4.1.0)
## car 3.0-11 2021-06-27 [1] RSPM (R 4.1.0)
## carData 3.0-4 2020-05-22 [1] RSPM (R 4.1.0)
## cellranger 1.1.0 2016-07-27 [1] RSPM (R 4.1.0)
## cli 3.0.1 2021-07-17 [1] RSPM (R 4.1.0)
## cluster 2.1.2 2021-04-17 [2] CRAN (R 4.1.0)
## codetools 0.2-18 2020-11-04 [2] CRAN (R 4.1.0)
## colorspace 2.0-2 2021-06-24 [1] RSPM (R 4.1.0)
## conflicted * 1.0.4 2019-06-21 [1] RSPM (R 4.1.0)
## cowplot * 1.1.1 2020-12-30 [1] RSPM (R 4.1.0)
## crayon 1.4.1 2021-02-08 [1] RSPM (R 4.1.0)
## crosstalk 1.1.1 2021-01-12 [1] RSPM (R 4.1.0)
## curl 4.3.2 2021-06-23 [1] RSPM (R 4.1.0)
## data.table 1.14.0 2021-02-21 [1] RSPM (R 4.1.0)
## DBI 1.1.1 2021-01-15 [1] RSPM (R 4.1.0)
## dbplyr 2.1.1 2021-04-06 [1] RSPM (R 4.1.0)
## digest 0.6.27 2020-10-24 [1] RSPM (R 4.1.0)
## dplyr * 1.0.7 2021-06-18 [1] RSPM (R 4.1.0)
## DT * 0.18 2021-04-14 [1] RSPM (R 4.1.0)
## ellipsis 0.3.2 2021-04-29 [1] RSPM (R 4.1.0)
## evaluate 0.14 2019-05-28 [1] RSPM (R 4.1.0)
## factoextra * 1.0.7 2020-04-01 [1] RSPM (R 4.1.0)
## FactoMineR * 2.4 2020-12-11 [1] RSPM (R 4.1.0)
## fansi 0.5.0 2021-05-25 [1] RSPM (R 4.1.0)
## farver 2.1.0 2021-02-28 [1] RSPM (R 4.1.0)
## fastmap 1.1.0 2021-01-25 [1] RSPM (R 4.1.0)
## flashClust 1.01-2 2012-08-21 [1] RSPM (R 4.1.0)
## forcats * 0.5.1 2021-01-27 [1] RSPM (R 4.1.0)
## foreign 0.8-81 2020-12-22 [2] CRAN (R 4.1.0)
## fs 1.5.0 2020-07-31 [1] RSPM (R 4.1.0)
## furrr * 0.2.3 2021-06-25 [1] RSPM (R 4.1.0)
## future * 1.21.0 2020-12-10 [1] RSPM (R 4.1.0)
## generics 0.1.0 2020-10-31 [1] RSPM (R 4.1.0)
## ggplot2 * 3.3.5 2021-06-25 [1] RSPM (R 4.1.0)
## ggpubr * 0.4.0 2020-06-27 [1] RSPM (R 4.1.0)
## ggrepel 0.9.1 2021-01-15 [1] RSPM (R 4.1.0)
## ggsignif 0.6.2 2021-06-14 [1] RSPM (R 4.1.0)
## ggwordcloud * 0.5.0 2019-06-02 [1] RSPM (R 4.1.0)
## globals 0.14.0 2020-11-22 [1] RSPM (R 4.1.0)
## glue * 1.4.2 2020-08-27 [1] RSPM (R 4.1.0)
## gtable 0.3.0 2019-03-25 [1] RSPM (R 4.1.0)
## haven 2.4.3 2021-08-04 [1] RSPM (R 4.1.0)
## highr 0.9 2021-04-16 [1] RSPM (R 4.1.0)
## hms 1.1.0 2021-05-17 [1] RSPM (R 4.1.0)
## htmltools 0.5.1.1 2021-01-22 [1] RSPM (R 4.1.0)
## htmlwidgets 1.5.3 2020-12-10 [1] RSPM (R 4.1.0)
## httr 1.4.2 2020-07-20 [1] RSPM (R 4.1.0)
## igraph 1.2.6 2020-10-06 [1] RSPM (R 4.1.0)
## janeaustenr 0.1.5 2017-06-10 [1] RSPM (R 4.1.0)
## jquerylib 0.1.4 2021-04-26 [1] RSPM (R 4.1.0)
## jsonlite 1.7.2 2020-12-09 [1] RSPM (R 4.1.0)
## knitr 1.33 2021-04-24 [1] RSPM (R 4.1.0)
## labeling 0.4.2 2020-10-20 [1] RSPM (R 4.1.0)
## lattice 0.20-44 2021-05-02 [2] CRAN (R 4.1.0)
## leaps 3.1 2020-01-16 [1] RSPM (R 4.1.0)
## lifecycle 1.0.0 2021-02-15 [1] RSPM (R 4.1.0)
## listenv 0.8.0 2019-12-05 [1] RSPM (R 4.1.0)
## lubridate 1.7.10 2021-02-26 [1] RSPM (R 4.1.0)
## magrittr * 2.0.1 2020-11-17 [1] RSPM (R 4.1.0)
## MASS 7.3-54 2021-05-03 [2] CRAN (R 4.1.0)
## Matrix * 1.3-3 2021-05-04 [2] CRAN (R 4.1.0)
## modelr 0.1.8 2020-05-19 [1] RSPM (R 4.1.0)
## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.1.0)
## openxlsx 4.2.4 2021-06-16 [1] RSPM (R 4.1.0)
## parallelly 1.27.0 2021-07-19 [1] RSPM (R 4.1.0)
## pillar 1.6.2 2021-07-29 [1] RSPM (R 4.1.0)
## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.1.0)
## png 0.1-7 2013-12-03 [1] RSPM (R 4.1.0)
## purrr * 0.3.4 2020-04-17 [1] RSPM (R 4.1.0)
## R6 2.5.0 2020-10-28 [1] RSPM (R 4.1.0)
## RColorBrewer 1.1-2 2014-12-07 [1] RSPM (R 4.1.0)
## Rcpp 1.0.7 2021-07-07 [1] RSPM (R 4.1.0)
## readr * 2.0.0 2021-07-20 [1] RSPM (R 4.1.0)
## readxl 1.3.1 2019-03-13 [1] RSPM (R 4.1.0)
## reprex 2.0.1 2021-08-05 [1] RSPM (R 4.1.0)
## rfm * 0.2.2 2020-07-21 [1] RSPM (R 4.1.0)
## rio 0.5.27 2021-06-21 [1] RSPM (R 4.1.0)
## rlang * 0.4.11 2021-04-30 [1] RSPM (R 4.1.0)
## rmarkdown 2.10 2021-08-06 [1] RSPM (R 4.1.0)
## rmdformats 1.0.2 2021-04-19 [1] RSPM (R 4.1.0)
## rstatix 0.7.0 2021-02-13 [1] RSPM (R 4.1.0)
## rstudioapi 0.13 2020-11-12 [1] RSPM (R 4.1.0)
## rvest 1.0.1 2021-07-26 [1] RSPM (R 4.1.0)
## sass 0.4.0 2021-05-12 [1] RSPM (R 4.1.0)
## scales * 1.1.1 2020-05-11 [1] RSPM (R 4.1.0)
## scatterplot3d 0.3-41 2018-03-14 [1] RSPM (R 4.1.0)
## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.1.0)
## SnowballC 0.7.0 2020-04-01 [1] RSPM (R 4.1.0)
## stringi 1.7.3 2021-07-16 [1] RSPM (R 4.1.0)
## stringr * 1.4.0 2019-02-10 [1] RSPM (R 4.1.0)
## tibble * 3.1.3 2021-07-23 [1] RSPM (R 4.1.0)
## tidygraph * 1.2.0 2020-05-12 [1] RSPM (R 4.1.0)
## tidyr * 1.1.3 2021-03-03 [1] RSPM (R 4.1.0)
## tidyselect 1.1.1 2021-04-30 [1] RSPM (R 4.1.0)
## tidytext * 0.3.1 2021-04-10 [1] RSPM (R 4.1.0)
## tidyverse * 1.3.1 2021-04-15 [1] RSPM (R 4.1.0)
## tokenizers 0.2.1 2018-03-29 [1] RSPM (R 4.1.0)
## tzdb 0.1.2 2021-07-20 [1] RSPM (R 4.1.0)
## utf8 1.2.2 2021-07-24 [1] RSPM (R 4.1.0)
## vctrs 0.3.8 2021-04-29 [1] RSPM (R 4.1.0)
## visNetwork 2.0.9 2019-12-06 [1] RSPM (R 4.1.0)
## vroom 1.5.4 2021-08-05 [1] RSPM (R 4.1.0)
## withr 2.4.2 2021-04-18 [1] RSPM (R 4.1.0)
## wordcloud2 * 0.2.1 2018-01-03 [1] RSPM (R 4.1.0)
## xfun 0.25 2021-08-06 [1] RSPM (R 4.1.0)
## xml2 1.3.2 2020-04-23 [1] RSPM (R 4.1.0)
## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.1.0)
## zip 2.2.0 2021-05-31 [1] RSPM (R 4.1.0)
## zoo * 1.8-9 2021-03-09 [1] RSPM (R 4.1.0)
##
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library